KTH ROYAL INSTITUTE 
ai» OF TECHNOLOGY 
"m: ^ 
KTH Ẹ% 


EH VETENSKAP Sp 


9$. OCH KONST 25 


acest 


Degree Project in Computer Science and Engineering 


Second cycle, 30 credits 


Vision transformer anomaly 
detection on mask writer servo 
logs 


A study of vision transformer for anomaly detection in 2D 
servo logs of industrial mask writers 


MATTIAS ARVIDSSON 


Stockholm, Sweden, 2024 


Vision transformer anomaly 
detection on mask writer servo logs 


A study of vision transformer for anomaly detection 
in 2D servo logs of industrial mask writers 


MATTIAS ARVIDSSON 


Degree Programme in Computer Science and Engineering 
Date: June 14, 2024 


Supervisors: Raghav Bongole, Stefan Fu 
Examiner: Markus Flierl 
School of Electrical Engineering and Computer Science 
Host company: Mycronic AB 
Swedish title: Vision transformer anomalidetektering på servo loggar från 
fotomaskritare 
Swedish subtitle: En studie om vision transformers för anomalidetektering av 2D 
servologgar från industriella maskritare 


© 2024 Mattias Arvidsson 


Abstract | i 


Abstract 


Anomaly detection is the task of identifying data that deviates from a normal 
set of data. When dealing with larger data a common approach is to split 
the data into same-sized patches for more feasible and uniform model input. 
A problem that occurs when using convolutions is that a larger patch size 
negatively impacts the detection performance on smaller anomalies and the 
smaller patch size risks missing anomalies that span over the patch size. In 
this thesis we investigate vision transformers for this anomaly detection task, 
with its self-attention it is a perfect candidate for this problem. After a study of 
the current field, we chose to implement InTra, an inpainting transformer that 
trains on reconstructing an obstructed part of the image using its surroundings. 
A four times increase in patch size was achieved with performance similar 
to its convolutional counterparts, being limited by available computational 
resources for increasing it further. InTra does not only keep up at this larger 
patch size but even improves local performance in some aspects by detecting 
known local anomalies that the convolutional models cannot. However, the 
increased patch size has not shown any improvements on larger anomalies but 
we see larger patch sizes as a great first step toward this goal. Additionally, 
this thesis investigates different training losses for InTra and their effect on 
performance. The standard Mean Squared Error (MSE) shows good results for 
InTra but is improved by combining it with Multiple Scale Gradient Magnitude 
Similarity (MSGMS). The statistical significance cannot be shown for all 
comparisons of the joint error versus the sole MSE, therefore, future studies 
are needed to improve the confidence of MSGMS performance improvements 
on surface defect detection tasks. 
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Sammanfattning 


Anomalidetektering går ut på att identifiera data som avviker från en normal 
uppsättning av data. När man hanterar större data är ett vanligt tillvagagangsatt 
att dela upp datan i lika stora lappar för att göra den mer hanterbar och bättre 
som modellinput. Ett problem som uppstår vid användning av konvolutioner är 
att en större lappstorlek påverkar detekteringsprestandan på mindre anomalier 
och användning av mindre lappstorlek riskerar att missa större anomalier som 
sträcker sig över lappens storlek. I den här masteruppsatsen undersöker vi 
vision transformers för detta anomalidetekteringsproblem, transformern är en 
perfekt kandidat för detta på grund av dess ”self-attention” mekanism. Efter en 
litteraturstudie av fältet valde vi att implementera InTra, en transformer som 
tränar på att återställa en ifylld del av bilden med hjälp av närliggande delar 
av bilden. En fyrdubbling av lappstorleken uppnåddes med liknande prestanda 
till tidigare konvolutionella modeller med mindre lappstorlek. InTra uppnådde 
inte bara liknande prestanda utan förbättrar den lokala prestandan på vissa 
områden och lyckas identifiera kända anomalier som kovolutionsmodellerna 
inte lyckas med. Dock ökar inte den större lappstorleken prestandan på större 
anomalier, men bra prestanda med större lappstorlek ser vi som ett bra första 
steg mot detta mål. Dessutom undersöker denna uppsats olika träningsförluster 
för InTra och deras effekt på prestandan. En intressant upptäckt är de lovande 
resultaten av Multiple Scale Gradient Magnitude Similarity (MSGMS), som 
presterar sämre pa egen hand men bra resultat nar den används i kombination 
med det vanliga fórlustfunktionen genomsnittligt kvadratiskt fel (MSE). 
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Chapter 1 


Introduction 


This thesis pertains to the topic of anomaly detection, more specifically 
anomaly detection in 2D servo logs from photomask writers. This has been 
done previously with both statistical methods and deep learning methods, 
although the latter has been limited to convolutional methods. These have 
shown promising results but convolutional methods have their limitations. 
Therefore, this thesis aims to adapt Vision Transformer (ViT) [1] for this 
anomaly detection task. 

Transformers [2] first showed great results within Natural Language 
Processing (NLP) and was adapted by Dosovitskiy et al., for image processing 
tasks [1]. ViT has been studied for the task of anomaly detection, this thesis 
aims to expand the knowledge surrounding ViT’s use for anomaly detection. 

In addition, this thesis has a connection to the field of Surface Defect 
Detection (SDD), which is mainly used in a manufacturing context to find 
defects in products. Methods for accomplishing this task range from old 
tried and tested computer vision algorithms and the more recent deep learning 
approaches [3]. A distinction that needs to be made is that SDD uses images 
taken by a camera as data, whereas the data in this project is two-dimensional 
servo logs. 


1.14 Background 


1.1.1 Anomaly detection 


Anomaly detection refers to the problem of finding inconsistent, abnormal, 
anomalous, or non-conforming data in a dataset. It has a wide variety 
of application fields and can provide significant and usually actionable 
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information. For example, unusual data traffic in a network could indicate an 
attack, anomalous medical images might indicate the presence of tumors, or 
anomalous outputs from a space-craft sensor could indicate faulty components 
[4]. For a broad survey of the general field of anomaly detection see [4]. 

With the rise of deep learning, there have been several methods developed 
to try and fully utilize its potential. Most methods aim to learn the feature 
representation of the data to distinguish outliers more proficiently, in ways 
humans or rule-based methods cannot. For a recent survey of deep learning in 
anomaly detection, see [5]. 


1.1.2 2D servo logs 


Mycronic develops and manufactures photomask writers. Photomasks are 
used in photolithography, which is a method for transferring the geometrical 
pattern of a photomask onto another substrate at a much higher speed than 
creating the masks themselves. This is the basis for producing integrated 
circuits, as well as modern displays. All photomasks for advanced flat panel 
displays are produced by mask writers developed and built by Mycronic [6]. 
For an overview of the process, see the introductory chapter in a PhD thesis 
from KTH [7]. 

The mask writers produce servo logs continuously during use, which 
contain all the motions and motion-related variables of the mask writer. The 
servo logs are analyzed for many purposes: Detecting errors in previous and 
current operations, monitoring the mask writers” performance, and general 
troubleshooting issues. Machine learning is an added extra step to this 
analysis, trying to reduce and assist manual labor, identifying more complex 
relationships in the data, and in combination with statistical methods finding 
anomalies that other methods cannot. 


1.2 Motivation 


Mycronic has investigated using variations of convolutional-based methods, 
such as autoencoders, Generative Adversarial Network (GAN), and U-Net for 
the detection of anomalies in the data. These have shown promising results, 
but currently have faced a tradeoff issue of trading performance on local vs. 
global anomalies. This is due to the size of the servo logs, an example of 
a large job is 2000 x 7000 pixels. To make the data more manageable for 
model input and the issue with varying sizes of inputs, the logs are divided 
into smaller patches that are subsequently fed as input to the models. The 
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tradeoff issue is that larger patch sizes hurt the performance of local anomalies. 
Transformer-based models are different from convolutional ones in their self- 
attentive mechanism, the self-attention layers are global as opposed to the local 
nature of convolutions [1]. Because of ViT’s unique mechanism, it is a perfect 
candidate to help this aforementioned tradeoff issue. 

The long-term goal is to find anomalies early in the manufacturing process, 
which would reduce rework and increase efficiency and profitability. 


1.3 Research question 


To focus the investigations of this thesis report, a main research question is 
formulated with additional sub-questions to further detail the research goals. 

What performance can be achieved using vision transformers for anomaly 
detection on servo logs? 


* Can it be useful by performing better than existing, convolutional-based 
methods, in some aspect? 


* Can vision transformers help with the local-global tradeoff? 


* What are the best training losses for this task? 


1.4 Delimitations 


The servo logs of the mask writer contain many logged variables, with many 
possible anomalies. This project is only concerned with the subset that is 
already annotated. 


1.5 Structure of the thesis 


Chapter 2 presents relevant background information about anomaly detection, 
display manufacturing, transformers, and other deep learning methods for 
anomaly detection. Chapter 3 presents the methodology and model used 
to solve the problem. Chapter 4 shows the acquired results and chapter 5 
discusses, draws conclusions from results, and proposes future work. 
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Chapter 2 


Background 


2.1 Anomaly detection 


Anomaly detection is a challenging task that has important applications in 
many fields, including fraud detection, intrusion detection, medical diagnosis, 
and quality control. Deep learning has emerged as a powerful tool for anomaly 
detection, owing to its ability to learn complex representations of data and 
identify patterns that are difficult to detect using traditional methods. 

Before delving deeper, it is important to define the different types of 
anomalies. Usually, anomalies are divided into three subcategories: Point 
anomalies, contextual anomalies, and collective anomalies. Point or local 
anomalies are individual data instances that differ from the rest of the data. 
Contextual anomalies are sets of data instances that, in the context of the 
rest of the data, are anomalous, but in and of itself does not constitute an 
anomaly. An example would be severely cold temperatures in the summer; 
cold temperatures are not anomalous in and of themselves, but in the context 
of summer, they are. Collective anomalies are when a collection of correlated 
data instances are anomalous to the rest of the data [4]. 

Further details for specific anomaly definitions for the dataset used in this 
project can be seen under Section 3.1. 


2.1.1 Unsupervised anomaly detection 


There are some prevalent problems in Anomaly Detection (AD), namely the 
rarity and class imbalance of abnormal data. Anomalies are rare instances 
compared to normal observations, which makes it infeasible to collect and 
annotate a large enough labeled dataset for supervised learning [8]. Even 
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if one has the time and resources to do this, it still does not solve the next 
problem. Namely, anomalies are everything that differs from the normal class. 
Therefore, it is in theory impossible to quantify all possible abnormal cases [9]. 

An attempt to solve this is unsupervised anomaly detection, where the 
training set is unlabeled and mostly normal and the rate of abnormalities 
is so low that it can in practice be seen as only normal [10]. Methods in 
unsupervised settings try to learn a normality that is presumed clustered, 
anomalies are then samples that differ enough from the estimated normality 
[9]. Some examples of this are shown in Section 2.1.4. 


2.1.2 Defect detection versus semantic anomaly de- 
tection 


Another important distinction is what the task deems an anomaly. Imagine a 
dataset consisting of varying types of cars; what would constitute an anomaly? 
In a broad context, two types/variants of anomalies can be classified. The first 
is for example a truck, motorcycle, or drastically different like a cat. This is 
usually referred to as semantic anomaly detection, in which the normal and 
abnormal differ in a semantic meaning. The other variant of anomalies would 
be defect detection, where normal and abnormal samples differ in a local area 
but are semantically the same. An example of this could be a scratch, dent, or 
missing parts of the car. [11]. Examples of defects can be seen in Figure 2.1. 


2.1.3 Benchmark datasets and standard metrics 


The current de facto benchmark for industrial anomaly detection in images 
is the MVTEC-AD dataset [12], which consists of 15 categories of images 
depicting both objects and textures. Anomalous images are annotated on 
an image level and pixel-wise. Some examples of normal, abnormal, and 
annotations can be seen in Figure 2.1. It consists of defect anomalies, mainly 
local but some contextual anomalies such as missing transistors, etc. 

The authors of MVTEC-AD propose some metrics to evaluate the 
performance of the AD models. Once again, the unbalanced distribution of 
normal and abnormal data becomes a problem in evaluation. If the anomalous 
cases are <1%, an accuracy of more than 99% can be achieved by a model 
returning anomalous no matter the input. The common solution in the 
literature is using Receiver Operator Characteristic (ROC) as a metric instead 
[12]. An illustration of ROC is shown in 2.2. ROC plots the True Positive 
Rate (TPR) versus the False Positive Rate (FPR). To get a single value, similar 
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Figure 2.1: Example images of the MVTEC-AD dataset. Top row: normal 
images, middle row: anomalous images, and bottom row: anomalous regions 
annotated [12]. 


0% P(FP), FPR 100% 


Figure 2.2: Visual explanation of components behind the receiver operator 
characteristic (ROC). CCSA3 License [13] 
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to accuracy, the area under the plotted ROC curve is calculated, also referred 
to as the ROC Area under the curve (AUC). 

There are some drawbacks to ROC, especially for smaller anomalies, since 
it weighs every pixel the same. This leads to a bias towards larger anomalies, 
to counteract this, the literature suggests the use of Per Region Overlap (PRO). 
To calculate PRO, anomaly scores are thresholded to obtain a binary map for 
each pixel. Following that, the percentage of correctly predicted pixels for each 
annotated area in the ground truth is recorded. This is repeated for different 
threshold values, a convention is to a FPR of 0.3. The average of the repeated 
thresholds is the final PRO metric. The advantage is that this weighs defect 
regions of varying size equally important [14]. 

In the world of NLP, human feedback is a common practice to attempt 
to translate qualitative data into more comparable quantitative data. This is 
typically achieved by either ranking different aspects of the qualitative data or 
ranking the qualitative data in correlation to other data of the same kind [15]. 

Heatmaps are a type of qualitative data, that visualizes the different 
magnitudes of the data as different colors. See Figure 4.1 for an example. 
Interesting behaviors and relevant information can be observed in individual 
samples, such as similar background patterns between the different models 
or highlighted anomalies different from the ground truth, but ranking by an 
expert is a way to draw more general conclusions about all the heatmaps and 
therefore the model as a whole [16]. 


2.1.4 Methods for deep-learning based anomaly de- 
tection 


To understand how anomaly detection works more precisely, one possible 
taxonomy divides anomaly detection methods into four categories: Recon- 
struction, classification, probabilistic, and distance-based [9]. 


2.1.4.1 Reconstruction based 


Reconstruction-based anomaly detection is an early method within neural 
network models trained for anomaly detection [17] and has become one of 
the most common approaches [10] in recent years. 

Its main objective is to train a model to competently reconstruct normality 
and be worse at reconstructing anomalous data. In its simplest form, 
this is done using an encoder-decoder model subsequently referenced as 
Autoencoder (AE). 
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Let $$ be the model that takes data space X — X onto itself, consisting of 
Qe : X — Z, encoder to latent space Z and ġa : Z — X, the decoder. With 
parameters 0. The most basic anomaly score is then the Euclidean distance 
between the initial data and the reconstructed. Given an unsupervised setting, 
therefore unlabeled data x1, ..., £n € X, the function to minimize is given by: 


. 1 
di da DD [ Lj = polzi) II? + Ro 


Where Ro is any regularization to combat the optimal model in its absence 
being the identity function. 

In its simplest forms, the AE is deterministic. The most explored type 
of stochastic AE [9] is the Variational Autoencoder (VAE) [18]. In general, 
it works by learning the features of a distribution of the latent space; this in 
turn allows the decoder to generate new data from the latent space that was 
not encoded by the encoder. Due to the inherent structure of AE, they have a 
problem of over-generalizing the reconstruction [19]. This cannot simply be 
fixed by having stricter regularizations, rather structural changes to the training 
have to be introduced [20]. One such method is the Masked Autoencoder 
(MAE) [21], which masks a set of patches and trains the autoencoder to 
reconstruct the missing pixels. 

Another important part of the quality of reconstruction is the loss functions 
used to minimize the reconstruction error during training [22]. The most 
common and simple error functions are pixel-wise ones, such as Mean 
Squared Error (MSE) and Mean Absolute Error (MAE). In an attempt to 
mimic the human perception for extracting structural information, Wang et al., 
introduced Structural Similarity Index Measure (SSIM) [23]. Bergman et al., 
have also shown its potential in reconstruction-based anomaly detection [24]. 
A more recent attempt to mimic human perception and extend the ideas of 
SSIM is Gradient Magnitude Similarity (GMS) [25]. Shown to work in 
anomaly detection by Zavrtanik et al., and extended to multiple scales by 
taking the average GMS map for different scales [26]. 

Given its popularity, the literature has discovered an interesting caveat to 
reconstruction-based anomaly detection. That the reconstruction can become 
too good for its anomaly detection purposes, that it generalizes too well and 
starts reconstructing anomalies decently as well [27][28]. One of the studied 
causes is the training loss, as the data gets more complex MSE has a greater 
risk of generalizing between normal and abnormal samples. One alleviating 
method is to increase the complexity of the training loss by adding other losses 
or adding adversarial training techniques [29]. 
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2.1.4.2 Classification based 


Classification-based models, also known as one-class classification or single- 
class classification models, use a discriminative approach to anomaly 
detection. Instead of learning the parameters of the normal dataset, they aim to 
learn a decision boundary that corresponds with the normal dataset and results 
in low error on unseen data. 

It is a particularly difficult problem since we have binary classification but 
a major imbalance of the data, in most cases only having the normal. With 
this imbalance, the goal is to find an acceptable middle ground between false 
positives and false negatives. More false positives will occur with a stricter 
decision boundary, and with a softer/more relaxed boundary, the false negative 
rate tends to increase [9]. 

With deeper models, the decision boundary is usually described as a 
hypersphere or n-sphere, where n is the dimension of the decision boundary’s 
sphere. To train these models, the distance of all training data to the 
hypersphere is minimized, with added regularizations [30]. 


2.1.4.3 Probabilistic based 


Probabilistic-based methods detect anomalies by estimating the probability 
distribution of normal data. On a rudimentary level, it could be fitting 
the normal data to a multivariate Gaussian and at test time taking the log- 
likelihood of the test data point in conjunction with some threshold for 
predicting abnormality [9]. 

Two of the most widely used methods for AD [10], VAE [18] and GAN 
[31] both suffer in performance when used as a purely probabilistic method for 
AD [9]. To accommodate for this, reconstruction and classification strategies 
are used [32], [33] in tandem. One pure probabilistic method is normalizing 
flows [34]; It is distinguished by the latent samples having the same dimension 
as the input data and the entire network is invertible. The benefit is that the 
density of the data can be computed exactly through a change of variables and 
used for anomaly detection [35]. 


2.1.4.4 Distance based 


Distance-based methods differ because of the absence of a loss function to 
minimize. Most methods have short or nonexistent training; Rather, the test 
data is compared to the training data as is. The anomaly detection is as the 
name suggests, based on the distance of the test data from the training data. 
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Typical algorithms are k-nearest neighbor, local outlier factor, and isolation 
forests [9]. 

A simple yet effective way to improve these methods is the use of pre- 
trained networks. The idea is that the pre-trained network and the features it 
extracts from the data are relevant to detecting anomalies, or that the network 
would help cluster the normal and abnormal data. At test time, the test data is 
propagated through the network, and its output is compared with the "training" 
data available using the same algorithms as before. One such example is Deep 
Nearest Neighbors [36]. 


2.2 Photomasks 


Photomask serves as a crucial part of photolithography, a technique in the 
manufacturing of flat panel displays and integrated circuits. Photomasks are 
patterns produced in mask shops by a mask writer which most commonly uses 
a laser to expose a photosensitive resist on a glass substrate. Subsequently, 
this pattern functions as a template for higher-capacity manufacturing. All 
photomasks for advanced flat panel displays are produced by mask writers 
developed and built by Mycronic, who also supplies the ground truth data for 
this thesis project. 

Figure 2.3 illustrates the different steps in photomask production and is 
more extensively described in a KTH PhD thesis [7]. In short, a large and 
expensive glass plate is coated with a chromium layer and a photosensitive 
chemical resist. The substrate is exposed with a laser in the mask writer. 
This develops the substrate through a chemical process, inspected for possible 
flaws, and possibly repaired. Subsequently, the pattern in the developed 
photoresist is transferred to the chromium layer through chemical etching, and 
another round of inspections is conducted. 

This is a very expensive process as the most expensive mask costs more 
than 100,000 USD to produce [7]. Given this high cost and that the final 
product is used as a master in an even more costly process, it is easy to see 
why there are so many inspections and why new effective inspections could 
be worthwhile. 

In an initiative by Mycronic, a new inspection step is introduced in 
conjunction with exposure [6], called Inspection 1.5 in Figure 2.3. This 
process involves logging the position and related variables of all axes and 
servos from the mask writer. The goal is to detect anomalies and unusual 
behavior in these logs, which are indicative of defects in the substrate. A 
benefit of this approach is that it is in real-time within the manufacturing 
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Inspection 1 


Pellicle mount 
Final inspection 
Mask shop 
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Figure 2.3: The photomask production process as described by [7], with added 
inspection 1.5 step [6] highlighted. 


process able to detect defects earlier than previously possible and therefore 
allows for earlier and easier repairs. Additionally, it serves as a valuable 
troubleshooting tool for proactively preventing similar defects in the future. 
Kurian [6] outlines two initial approaches for this purpose. The first 
is a rule-based utilization of classical statistical analysis with domain- 
knowledge-informed transformations of the data and the second is an AE for 
reconstruction-based anomaly detection. 


2.3 Vision Transformer 


The first transformer model introduced by Vaswani et al., [2] has become the 
de-facto standard for NLP tasks. Its novelty is the use of solely self-attention 
and not recurrence nor convolutions, which were the standards at the time [2]. 
Classically, transformers operate on sequences, sets of words usually referred 
to as tokens applying the aforementioned self-attention on them. Attention 
can be seen as a function that maps a query and a set of key-value pairs to 
the output. The query (Q), keys (K), values (V), and output are all vectors 
with query and key vectors of the dimension d; and the values vector with 
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dimension d,. The attention calculation can be seen in 2.1. 


KT 


| Q 
Attention(Q, K, V) = softmax( 
v di, 


There are two commonly used types of attention, additive and dot- 
product attention, Vaswani’s attention is identical to the dot-product attention, 
except for the scaling factor Ar the dimensions of the queries and keys. 
Theoretically, the two attention types are similar, in practice the dot-product is 
faster and more space efficient due to highly optimized matrix multiplication 
code. 

The next important step in Vaswani et al., attention mechanism is the multi- 
head. Instead of performing only a single attention function, they linearly 
project the queries, keys, and values h times, h being the number of attention 
heads. The multiple heads allow the transformer to attend to information 
from different representation subspaces at different positions, something that 
would not be possible with a single attention head due to averaging. For ease 
of understanding one can think of the different attention heads as learning 
different aspects of the same subspace, a NLP example would be one head 
focusing on co-reference resolution and another on possible negations of the 
same token. 

Itis trivial to see how pixel-wise attention with the quadratic dot-product is 
problematic for Transformers application to images. A typical 256x256 pixel 
image would result in 256% = 4.3 billion operations, which is unfeasible. 
Initial attempts to solve this included kernels of local attention with depth, 
which is similar to convolutional nets and still lacked the global attention of 
the classical Transformer. Dosovitskiy et al., adapt the transformer for images 
by taking global attention on the image segmented into patches of 16 x 16 
pixels, flattening these patches and linearly projecting them. The flattened 
and projected patches are then referred to as patch embeddings and position 
embeddings are added to retain positional information about the patches. 
These patch + position embeddings are then fed to the transformer encoder 
as input [1]. The architecture of ViT and the original transformer encoder can 
be seen in Figure 2.4. 

ViT performs well in comparison to its convolutional-based counterparts 
and attains excellent results, showing that the previous heavy reliance on 
convolutions is not necessary and that pure transformer-based networks can 
work well on image tasks [1]. The consensus seems to be that transformer and 
convolutional-based models perform similarly on most tasks. 

The original ViT’s loss function can be seen as non-convex, whereas 


) (2.1) 
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Figure 2.4: The architecture of the original ViT as shown in [1] 


ResNet, a common convolutional model, is near-convex. This leads to 
worse performance on smaller datasets as this non-convex loss disrupts 
training. Large datasets seem to help ViT to learn strong representations by 
convexifying the loss and this leads to better performance [37]. A common 
approach nowadays is to fine-tune a pre-trained ViT to a smaller more task- 
specific dataset [38]. Other approaches introduce a new kind of loss, one such 
example is masking like the aforementioned MAE [21]. 

An attempt to alleviate the high computational cost of ViT is the Swin 
transformer, using a shifted windows approach to cut the computational cost 
of ViT from quadratic to linear with respect to image size [39]. 


2.4 Related work 


2.4.1 Anomaly detection using ViT 


The initial implementation of ViT was for classification, but ViT is also 
applicable to anomaly detection. The authors of VT-ADL use a vision 
transformer as an encoder and a Convolutional Neural Network (CNN) as 
a decoder, additionally adding a Gaussian mixture density network to try 
and improve anomaly localization [40]. AnoViT similarly uses a vision 
transformer as the encoder and a CNN as the decoder [41]. They are both 
reconstruction-based methods, and both suffer from large network size and 
long training time [42]. 
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A more purely transformer-based approach is that of authors Cohen and 
Avidan with their probability-based model Transformaly [11], utilizing a pre- 
trained ViT that has trained on a larger dataset to fine-tune an uninformed 
student network. Fitting a Gaussian to both feature spaces and using these for 
the anomaly score, the product of the likelihood from both spaces to the point. 
Transformaly focuses on semantic anomaly detection, which is previously 
discussed in Section 2.1.2. This work is heavily inspired by the CNN based 
approach of Bergman et al., [43] also utilizing a pre-trained network and 
uninformed student network. 

Continuing with pure transformer-based models, Pirnay and Chai 
[44] introduce InTra an inpainting transformer for anomaly detection, a 
reconstructive method that draws inspiration from masked autoencoders [21] 
by covering a patch of the image and training the transformer to reconstruct 
the covered patch using the surrounding neighborhood of patches. The 
architecture consists of stacking transformer blocks and using long residual 
connections to connect the blocks and fully utilize the information from 
all blocks. To mitigate problems of patches in training being too similar, 
resulting in near uniform softmax-weighted sums in the attention mechanism, 
an adapted version was created. Switching out the weights for q and k in 
the attention mechanism with two single hidden layers Multilayer Perceptron 
(MLP), heavily increased the parameters of the models but the authors saw 
improved results that outweighed that cost. A drawback of keeping the 
model fully transformer-based is the inference time, where an image is fully 
reconstructed by combining reconstructed patches from the model output. 

The aforementioned Swin transformer has been adapted to industrial 
anomaly detection with MSTUNet [45]. MSTUNet is interesting by 
introducing its own anomaly generator and therefore becoming a self- 
supervised method rather than unsupervised. This alleviates problems with 
small datasets but introduces a bias towards the generated abnormal samples, 
which are influenced by human decisions. 


2.4.2 Convolutional-based methods 


Currently, the state of the art on the MVTEC-AD dataset is heavily occupied by 
convolutionally based methods. Some notable examples are PatchCore [46], 
ReConPatch [47] and PaDiM [48], all utilizing a pre-trained CNN in different 
ways. 

PatchCore makes use of a memory bank to store the intermediate or mid- 
level feature representations from the pre-trained CNN, storing these by using 
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an efficient greedy subsampling method that defines the coreset to be saved. 
This is done to avoid features that are too geared towards the classification 
task of the ImageNet pre-trained CNN. At test time, an image will be fed 
through the pre-trained network, and its features compared to the memory 
bank with the nearest neighbor algorithm, and from that the anomaly score 
and segmentation are derived. It increases the speed of inference with the 
proposed locally aware patch feature, in addition to the assumption that if one 
patch is anomalous the whole image can be classified as anomalous [46]. 

A more recent but similar approach is that of Hyun et al., with ReConPatch, 
also using a memory bank but differing in the features to subsample into the 
coreset. ReConPatch does this by training a linear modulation of patch features 
extracted from the pre-trained CNN, utilizing contrastive representation 
learning to achieve easily separable feature representations. It performs the 
same nearest neighbor algorithm as its PathCore counterpart at test time 
[47]. Both are two great examples of more advanced distance-based anomaly 
detection methods, going further than just using the raw output of pre-trained 
networks for distance-based metrics. 

PaDiM is a probability-based method with a similar patch focus as the 
previous work, estimating patch-level features from a pre-trained CNN to a 
Gaussian distribution. For inference, it assigns anomaly scores on a patch level 
depending on the Mahalanobis distance of the patch features to the estimated 
Gaussian [48]. 

A move away from the common reconstruction-based methods can be 
observed in the recent state-of-the-art CNN anomaly detection methods. This 
is partly because of an inherent problem with the convolution mechanism, as 
it is difficult to learn rich generalizable representations when the sample size 
is not large enough. The convolutions affect more than just the anomalous 
area; therefore, in a reconstructive setting, it can be hard to distinguish small 
anomalous regions from their background with simple convolution operations 
[49]. 


2.4.3 Self-attention or Transformers with CNNs 


With the success of CNNs and the promising performance of ViT for anomaly 
detection, a reasonable idea would be to combine these to utilize their 
advantages with the hope of mitigating their drawbacks. Initial attempts 
SIVT [50], UTRAD [51] and ITran [42] all utilize a pre-trained CNNs and 
subsequently pass it along to their different transformer architectures for 
anomaly detection. SIVT applies the proposed self-induction paradigm, which 
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introduces a learnable auxiliary induction sequence. This is subsequently fed 
to the transformer encoder and the features are later reconstructed using a 
transformer decoder and reconstruction error in the features is used for training 
and testing [50]. UTRAD feeds the CNN output to a U-Net [52] architecture 
with transformer encoders, training, and testing uses the reconstruction error 
of the reconstructed CNN features [51]. I Tran is a smaller network than SIVT 
but quite similar, differing slightly in how the features are transformed before 
propagating through the transformer encoder. 
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Chapter 3 
Methods 


3.1 Dataset description 


The dataset is constructed by Mycronic, logging the servo measurements 
on mask writers during the exposure step seen in Figure 2.3. In the realm 
of industrial linear robotics, analogous to mask writers, the logs can be 
interpreted as heatmaps or images over the main substrate axis, denoted as 
x and y. Examples from the dataset with anomalies highlighted can be seen in 
Figure 3.1. 

Through collaboration with system experts, all abnormal heatmaps 
are annotated pixel-wise. This is a particularly difficult process without 
specialized knowledge, since certain patterns may appear anomalous but are 
normal. An important detail is that Mycronic’s current best convolutional 
model assisted in the annotation process; therefore, the ground truth has a 
particular bias towards it. However, it is not heavily reliant on it and thus 
can be utilized for evaluation with this caveat in mind. Additionally, for this 
project, the type of anomaly is labeled. There are many different types of 
anomalies, but for the sake of simplicity, this project will generalize them into 
three different types: Points, lines, and regions. For this project point anomaly 
and local anomaly can be used interchangeably. Line and region anomalies are 
a bit trickier, where some could be classified as local but most would classify 
as global anomalies. 

The dataset consists of 289 normal logs and 151 abnormal logs, these 
are divided into 237 logs for the training set and 52 logs for the validation 
set aiming for a conventional 80-20 split between training and validation. 
Due to the unsupervised setting, no abnormal data is used in the training 
and therefore all abnormal logs were used for the test set. Note that this 
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distribution of abnormal logs does not represent the true incident rate, it 
merely reflects the available data and the implicit threshold chosen by expert 
labelers. Additionally, logs that are labeled as anomalous do not consist purely 
of anomalies rather they consist of a large part of normal patches and pixels 
as well. 


(a) Point anomaly (b) vertical line anomaly 


Figure 3.1: Examples of logs in the dataset, with anomalies highlighted. 


3.1.1 Patches 


One crucial distinction between the objective of detecting anomalies in 
common datasets like MVTEC-AD and this setting lies in three primary 
factors: The resolution, the relative size of the smallest anomalies, and varying 
image sizes. 

The MVTEC-AD dataset consists of images with a resolution varying in 
the range of 700 x 700 to 1024 x 1024 and is sourced from a regular RGB 
industrial inspection camera. Most research tends to crop and downsample 
the images to a standardized size of 256 x 256 pixels. Conversely, the servo 
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logs for a 2D substrate in a linear industrial robot can have an arbitrarily large 
resolution, contingent on the sampling frequency used. In the case of a mask 
writer, the maximum resolution is around 2000 x 7000 pixels. Such high- 
resolution data could be downsampled, but this would introduce an increased 
risk for error. 

Firstly, in this dataset there exist tiny anomalies, similar to the point 
anomaly shown in Figure 3.1a. Downsampling almost guarantees averaging 
out these small anomalies. Secondly, precise localization of anomalies is 
significant to fully utilize the data. The third problem is the varying image 
sizes; in a typical industrial setting, like most examples in MVTEC-AD, the 
same camera is used for the same product in a repeating process. However, 
different masks have different sizes and different sampling frequencies, which 
results in widely different-sized jobs to process. 

One attempt to alleviate these errors is a common practice in the medical 
imaging industry, which also suffers from varying image sizes [52] and is 
known as a sliding window approach. Instead of treating the image as a single 
entity, it can be divided into several smaller patches, such as 32 x 32 or larger 
256 x 256, with possible overlaps between them. To accommodate for the 
image sizes not divisible by the patch size, the less important edges are cut 
out. This approach also enables a simple anomaly localization algorithm, 
by assigning the anomaly score from the model to each patch, thus requiring 
smaller patches to have enough granularity. The model chosen in this project 
does not require this and can easily produce pixel-wise anomaly scores; 
therefore, a larger patch size is utilized. A simple illustration of the process 
can be seen in Figure 3.2. 


3.2 Model selection 


The literature review showed that there are many ways to use transformers 
for anomaly detection. Even the state-of-the-art convolutional-based methods 
all incorporated a type of patch-based technique, which is inherent for vision 
transformers. Since this project focuses on specifically vision transformer use 
for AD, a pure transformer-based model is chosen. Furthermore, because of 
the smaller amount of data available and the nature of it, a model with less 
data need and a focus that is not fully based upon semantic anomaly detection 
is favored. With this in mind, the most promising model seems to be the 
inpainting transformer (InTra), from the paper by Pirnay and Chai [44]. 
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Figure 3.2: A simple illustration of the patch division. In real examples, it 
covers the whole 2D servo log. 


3.2.1 InTra 


As previously mentioned, InTra consists of a stack of n transformer encoder 
blocks connected with long residual connections, this can be seen in Figure 
3.4. The transformer encoder consists of two layers alternating, the adapted 
multi-head self-attention layer and MLP blocks. Residual connections 
are applied after every block inside the transformer encoder and layer 
normalization is applied before each block, the architecture is visualized on 
the right side of Figure 2.4. The network is trained by randomly sampling 
batches of patch windows with a fixed side length. In each window a patch is 
randomly selected and inpainted by the network, this selection can be seen in 
Figure 3.3. For the loss function, comparisons between the original patch and 
its reconstruction are made using pixel-wise MSE loss, SSIM and GMS. 


3.3 Model comparison 


InTra is tested against two of Mycronic’s best performing convolutional 
models, a Dual Autoencoder Generative Adversarial Network (DAGAN) [53] 
based model and a U-Net [52] based model. Two categories of comparisons 
will be made: Quantitative, which includes ROC, PRO, and expert ranking 
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Figure 3.3: The patch selection process of InTra, as shown in [44]. 


Figure 3.4: The transformer block architecture of InTra, the blocks are 
connected with long residual connections. L is the size of the neighborhood 
of patches, K is the size of these patches, C is the number of channels and D 
is the dimension of the latent space that the neighborhood patches and their 
positional information is mapped onto. From [44]. 
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of heatmaps, and qualitative, looking at heatmaps with expert analysis. The 
quantitative comparisons on ROC and PRO are the mean of the individual runs 
global mean over the logs. 

The majority of the metrics are quite straightforward, but the expert 
ranking of heatmaps could be further explained. Since individual comments 
and analysis on each heatmap would be superfluous, the expert rank is an 
attempt to draw broader conclusions from these expert analyses. At the time 
of the comparison, only one expert was available for ranking. It is done by 
presenting the expert with n different heatmaps produced by n different models 
and having the expert give them a relative rank, Rreative from 1 to n. If an 
output is considered equal the rank is weighted so that the sum of ranks is 
always the same, the triangle number for n. Further, for each set of heatmaps 
ranked, an additional ”versus score” of O to 3 is given for that set. 


* 0: heatmaps equal. 

* |: heatmaps functionally the same, but visually better for one. 
* 2: heatmap functionally better for one. 

* 3: one model succeeds where the other fails 


This allows for a weighted rank, calculated: Rueigniea = Max(versus_score, 1)* 
Rrelative. Penalizing models that perform worse in the qualitative aspect. 


3.3.1 Implementation details 


Due to time and resource restrictions, light, manual, hyperparameter tuning 
is done by inspecting the loss graph and the resulting heatmaps. With these 
restrictions in mind, the goal was to set reasonable values for mainly the 
learning rate and number of epochs. Initially, an early stopping technique 
was used based on the validation loss, although upon further investigation the 
validation loss seemed to rise for a couple of epochs to later drop lower than the 
initial rise. It was therefore decided to set the number of epochs to a set value 
after inspecting the loss graph and heatmap outputs. The Adam optimizer was 
used and many of the parameters were taken from the original InTra paper. 
These main parameters are listed in Table 3.1. Two notable differences in 
implementation from the original paper are training loss function and training 
data preparations. The loss function is further improved with the use of 
Multiple Scale Gradient Magnitude Similarity (MSGMS) instead of its simpler 
predecessor used in the original paper GMS, Zavrtanik et al., which shows 
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Setting Name Value Notes 
Patch partition Patch size 256 x 256 
Stride 256 Same for training and inference 
InTra full Learning rate 0.0001 Same for all variations 
Transformer blocks 13 
Attention heads 8 Same for all variation 
Hidden layer dimension 512 Same for all variations 
Epochs 200 Same for all variations 
InTra medium Transformer blocks 7 
InTra small Transformer blocks 4 


Table 3.1: Hyperparameters used. 


improved results using MSGMS over GMS [26]. The second difference is 
in the training phase, where the original paper augments the dataset using 
rotation and flipping. As the mask writer has a direction of writing, the data is 
not invariant to rotation and flipping, therefore these augments were skipped 
and potential increases to epochs were explored as an attempt to counteract 
this. 

To measure InTra’s performance on the dataset, three different-sized 
models are created: InTra Full, InTra medium, and InTra small. InTra Full has 
the same architecture and size as the original paper with ~55M parameters, 
InTra medium has fewer transformer blocks and ~29M parameters, and InTra 
small includes even fewer transformer blocks and ~19M parameters. 

The implementation was done using Tensorflow 2.14 and ran on two 
Nvidia RTX 3080 Ti and one Nvidia RTX A3000 GPU. 


3.4 Statistical methodology 


Statistical tests are used to determine if the differences in performance for 
the different settings compared are significant and not due to chance. These 
tests are done with a significance level of 5%. To compare multiple groups 
simultaneously, Welch’s one-way ANOVA [54] is used, which tests the 
null hypothesis that all compared group’s averages are identical. Unlike 
traditional ANOVA, Welch’s does not assume compared groups to have the 
same variance, although it does assume Gaussian distribution of the variables. 
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Given that the training has converged to a certain degree, it is reasonable 
to assume the performance metrics are normally distributed. However, to 
ensure comparability, the training parameters (learning rate, epochs, etc.) 
are kept the same for variations of investigated settings. Since these are set 
by manual means, there are chances that some settings converge faster than 
others. Therefore, it is not reasonable to assume that all variances are the same, 
and Welch's ANOVA should be preferred over traditional ANOVA. Welch’s 
ANOVA allows us to test the null hypothesis that all compared groups have the 
same mean, but it does not indicate which groups have significant differences. 

For individual group comparisons the Games-Howell [55] test is needed, 
which tests for normally distributed but not equivariant variables. The Games- 
Howell test is defined as: 


Ti — Tj > Qok df (3.1) 


Where q is Tukey’s studentized range distribution, K is the number of groups 
and c is equal to the standard error shown below: 


1/s2 8? 
e (E n =) (3.2) 
2 ni nj 
Welch's correction is used to calculate the Degrees of Freedom (Df): 
2 $2 2 
(n+ 5) 
- : > (3.3) 
(à) , (3) 
ni | Tj 
nj—l | nj—1 


Welch's t-test is then used to get the t-value: 


Li — Tj 


2 2 
S NGC 
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With these variables, the confidence intervals can be constructed like so: 


t= (3.4) 


1 2 2 
a -F (E+ 2) (3.5) 


ni Nj 


Lastly, to get the p-values we use Tukey’s studentized range: 


Q (t /2,k,df) (3.6) 
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Chapter 4 


Results 


The main question this project aims to answer is the effect of transformer- 
based models compared to convolutionally-based models. To investigate 
this, we start by inspecting some qualitative data. In the next Section, 
we move on to quantitative data, such as ROC, PRO, and expert scores, 
subsequently performing some statistical tests on the AUC measurements. 
Additionally, we investigate different model sizes and how different losses 
affect the performance of the model. 


4.1 Qualitative 


To gain a more visual understanding of the compared models, we start by 
inspecting various heatmaps. These are not all of the heatmaps, but rather 
a few showcasing interesting behaviors, as going through all of the heatmaps 
would be superfluous for this report. In an attempt to get an overview of the 
heatmap results an expert rank of the heatmaps is calculated and displayed 
under the quantitative part of the results in Table 4.3. 
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(a) Region anomaly, where DAGAN, U-Net, and InTra are equally good. 
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(b) Line anomaly, where InTra is slightly easier to distinguish both lines. 
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(c) InTra easier to distinguish smaller region anomaly. 


Figure 4.1: Qualitative results. 
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First in Figure 4.1 we have some examples of InTra performing similar or 
slightly better than its convolutional counterparts in DAGAN and U-Net. The 
difference is mostly in InTra being slightly easier to distinguish the anomalies. 
Starting from the left we have the anomaly score output of DAGAN, U-Net, 
InTra full, InTra medium, and last on the right-hand side the ground truth (GT). 
In Figures 4.1b and 4.1c there appears to be more output from DAGAN and 
U-Net at the very top. This is due to the patch division process of the input 
data, the dimensions of that extra input data are greater than the convolutional 
model’s patch size of 64x64 but smaller than InTra’s 256x256. It therefore is 
not included in the InTra outputs. 
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(a) Point anomaly where InTra has worse contrast. 
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(b) InTra performing worse on line of local anomalies. 


Figure 4.2: Qualitative results. 
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In Figure 4.2 we have some examples of InTra performing worse than 
DAGAN and U-Net. The contrast in Figure 4.2a is bad enough to deem 
it a failed detection, whereas in Figure 4.2b InTra is just slightly less 
distinguishable. Once more we can see in both Figures the effect of the patch 
division of the input, emphasized more in Figure 4.2b. 


DAGAN UNet InTra full InTra medium 
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(a) InTra first model to weakly detect known point anomaly. 


InTra full InTra medium GT 
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(b) Zoomed in on area of interest for the point anomaly. 


Figure 4.3: Point anomaly, where InTra weakly finds the anomaly and DAGAN 
and U-Net fails. With additional zoomed-in figures on areas of interest. 


InTra is the first model to detect a known local/point anomaly, which can 
be seen in Figure 4.3. It is particularly tricky since it is a contextual anomaly, 
the values by themselves are not anomalous but in this context they are. The 
anomaly score is not easily distinguishable so Figure 4.3b is provided to aid 
the viewer in observing the detected anomaly. 

Overall, InTra’s results are comparable, similar, and, in some cases, 
qualitatively better than the convolutional baseline of DAGAN and U-Net. 
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InTra seems to perform worse on a few local anomalies but performs better 
on certain images of a particularly difficult subset of the data. 
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(a) Heatmap comparison of different loss functions as anomaly score. 
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(b) Heatmap comparison of different loss functions for training. The anomaly score 
for the heatmaps is produced by the MSE. 


Figure 4.4: Heatmap comparison on different loss functions for anomaly score 
and their combinations with MSE. The input is the same for both (a) and (b). 


Moving on to the qualitative loss comparison. We first have the same 
model with different training loss compared and with that training loss used 
as the anomaly score in Figure 4.4a. In Figure 4.4b we have the same model 
using a joint MSE with other loss for training and MSE as the anomaly 
score function. Looking at Figure 4.4a, MSE seems crucial for obtaining 
meaningful and high-quality heatmaps. Purely basing the anomaly score on 
SSIM, cosine similarity or Kullback-Leibler divergence (KL) loss produces 
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lackluster results, as can be seen in Figure 4.4a. MSGMS has a decent 
performance but lacks consistency when looking over several logs. Showing 
worse results overall are that of KL, SSIM and cosine similarity losses, both 
by themselves in Figure 4.4a and combined with MSE in Figure 4.4b. 


4.2 Quantitative 


To quantitatively compare InTra and its different settings to the convolutional 
models, each setting is trained and evaluated three times. 


ROC AUC for different models 


ROC AUC 
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(a) ROC AUC values for the different models with light grey bars showing standard 
deviation. 


PRO AUC for different models 


PRO AUC 
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(b) PRO AUC values for the different models with light grey bars showing standard 
deviation. 


Figure 4.5: Quantitative comparisons of models. 
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In Figure 4.5, the mean of the three runs pixel-wise AUC of ROC and PRO 
is shown as dots and the standard deviation is shown as grey bars. DAGAN 
and U-Net perform best, with InTra medium being the best in both ROC and 
PRO for the InTra models. 


Anomaly type\model | InTra full InTra medium InTrasmall DAGAN U-Net 


Point 0.873 0.894 0.890 0.980 0.978 
Region 0.627 0.680 0.665 0.916 0.934 
Line 0.663 0.723 0.707 0.934 0.925 


Table 4.1: ROC AUC metrics for different anomaly types. 


Anomaly type\model | InTra full InTra medium InTrasmall DAGAN U-Net 


Point 0.764 0.797 0.792 0.959 0.953 
Region 0.291 0.375 0.354 0.828 0.837 
Line 0.383 0.481 0.460 0.832 0.808 


Table 4.2: PRO AUC metrics for different anomaly types. 


As previously mentioned, the dataset can broadly be defined as consisting 
of three different anomaly types: point, region, or line anomaly. Inspecting 
the models with this additional information can provide great insight into the 
strengths and weaknesses of a model. The different model’s performance can 
be seen in Figure 4.1 for ROC AUC and Figure 4.2 for PRO AUC. DAGAN 
and U-Net have the best performance in both metrics and InTra medium is the 
best out of the InTra models. All models perform better on point anomalies 
and are worse on region and line anomalies. DAGAN performs best on point 
and line anomalies across both metrics, InTra has a similar pattern of being 
better at line anomalies than region. U-Net outperforms both on the regional 
anomalies on both the ROC and PRO metrics. 

As shown by the qualitative comparisons, MSGMS performs quite well 
both on its own and in combination with MSE. A question that this report 
aims to answer is regarding losses for transformers in this new data setting. A 
valuable question is then if MSGMS can aid MSE to produce better, higher 
quality heatmaps or performance metrics. Or, is MSE a superior loss function 
that still performs well despite MSGMS? To investigate this, the different- 
sized InTra models are trained using either solely MSE or a joint loss function 
of MSE + MSGMS. Each setting is repeated for three runs and presented in 
Figure 4.6. 
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Figure 4.6: Quantitative comparisons of models. 


Initial observations seem to indicate MSE + MSGMS to perform better 
than the lone MSE. Another observation is that of the decreasing size of the 
InTra model and the difference between the two loss settings. Pure MSE 
for two out of three variants has quite increased variance as opposed to its 
counterpart. 


Model 
Expert ranking | DAGAN U-Net InTrafull InTra medium  InTra small 
Relative Both 4.548 4.506 4.440 4.398 4.440 
Relative MSE | 4.548* 4.506* 4.873 4.398 4.398 
Weighted Both | 5.133 5.042 5.054 5.012 5.054 
Weighted MSE | 5.133*  5.042* 5.452 5.012 5.012 


Table 4.3: Average expert ranking of heatmaps, the first row is relative ranking 
and the second row is weighted. For both rows, a lower score is better. Best 
values are highlighted in bold. * Same as above since DAGAN and U-Net are 
not included in the training loss comparison. 


The expert ranking shows InTra slightly better than DAGAN and U-Net, 
although, some variations of InTra perform worse than both DAGAN and U- 
Net. InTra medium performs the best on both MSE + MSGMS and pure MSE. 
MSE + MSGMS loss performs better on the full variant but worse for the small 
InTra variant, the difference is greater in the better performance. 


T 
small 


(b) PRO AUC values for the different models. 
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4.2.1 Statistical analysis 


To check if the above results are significant, Welch’s ANOVA and Games- 
Howell are employed. 

The main question of this project is comparing transformer-based and 
convolutional-based models. To support this statistically and check the 
validity of the results found, we perform Welch’s ANOVA test on the three 
InTra models and two convolutional models. We compare the calculated p- 
values with a significance level of a = 0.05. Results can be seen in Table 
4.4. 


Metric | p-value 
ROC AUC | 2- 107? 
PRO AUC | 5- 10-6 


Table 4.4: p-values for Welch’s ANOVA with the null hypothesis that the 
averages of InTra models and convolutional models are not different. Values 
below the o are highlighted. 


We see that for both metrics there is a statistically significant difference 
in it. To see what and between which model type/setting this is, the post hoc 
Games-Howell test is used and results are shown in Table 4.5. 


ComparisonMetric | ROC AUC PRO AUC 
DAGAN vs Full | 1.62.10 ? 6.56.10 4 
DAGAN vs Medium | 3.80 -1075 4.20- 107? 
DAGAN vs Small | 1.48- 107^ 4.44.1074 
Full vs Medium 2.92 . 107? 2.15.10? 
Full vs Small 7182-10 653-107 
Full vs U-Net 1.89.107* 1.63.1074 
Medium vs Small 1.86 - 107! 3.57-107! 
Medium vs U-Net | 8.35-1074 1.31-107? 
Small vs U-Net 3.84 - 1073 5.60-107* 


Table 4.5: p-values of Games-Howell testing the null hypothesis that the 
compared InTra models and convolutional models do not differ. Values below 
the a are highlighted. 


As can be seen in Table 4.5, all but two comparisons are significantly 
different, namely Full vs Small and Medium vs Small. This makes it harder to 
derive significant conclusions about model size and performance from these 
tests. 
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To investigate the second question of this project, MSE vs MSE + 
MSGMS, a similar approach was taken. Starting with Welch’s ANOVA test 
shown in Table 4.6. 


Metric | p-value 
ROC AUC | 3.18- 1072 
PRO AUC | 4.23-1074 


Table 4.6: p-values for Welch’s ANOVA with the null hypothesis that the 
different loss settings for InTra are not different. Values below the a are 
highlighted. 


Once again Welch’s ANOVA showed that there is a significant difference 
but not where, therefore Games-Howell test was used with results shown in 
Table 4.7. 


Comparison\Metric ROC AUC PRO AUC 
Full both vs Full MSE 4.94 107% 7.03 - 107? 
Full both vs Medium MSE 9.59.10! 9.99-1071 
Full both vs Small MSE 6.37-107? 5.62. 107? 


Full MSE vs Medium both | 3.11- 10? 6.05- 10? 
Full MSE vs Small both 4.03.10? 1.37.10? 
Medium both vs Medium MSE | 8.32-107! — 8.40. 1071 
Medium both vs Small MSE | 4.75.10! — 841-107! 
Medium MSE vs Small both 9.55 - 1071 9.11 - 107! 
Small both vs Small MSE 6.56 - 107! 2.90 - 1071 


Table 4.7: p-values of Games-Howell testing the null hypothesis that the 
compared loss settings for InTra models do not differ. Values below the a 
are highlighted. 


It can be seen in Table 4.7 that only one same model size comparison 
is significantly different and two other loss comparisons of different model 
sizes. We see that there is a statistically significant difference between InTra 
full trained with purely MSE and trained with the combination of MSE + 
MSGMS. However, it does not follow that there is a statistically significant 
difference between the same comparison for InTra medium and small. A 
greater sampling size could help us conclude with more confidence regarding 
InTra's performance with different training losses. 
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Chapter 5 


Discussion 


5.1 Key findings and their validity 


* Good qualitative performance on smaller anomalies with bigger patch 
windows. 


* Similar performance to DAGAN and U-Net and on some subsets of the 
data slightly better than its convolutional counterparts. 


* InTra first to find known anomalies, that current convolutional models 
fail to find. 


* Minor improvement in performance with MSGMS used in conjunction 
with MSE for defect detection-like tasks. 


The main problem with the experimental setup of this thesis project is the 
small number of samples for the repeated experiments. This is a problem since 
chance plays a greater risk of influencing the results since each run is a greater 
part of the average. There were some reasons for the smaller sample size, 
mainly that InTra was implemented later in the project than expected due to 
a pivot after failed attempts with other models. The later implementation in 
combination with time and resource limitations led us to the current situation. 
However, we believe that there is some value in the findings and the limited 
sample size mainly affects the loss comparison. In the following paragraphs, 
we are going to further discuss the findings and their limitations. 

Initial observations indicate the joint loss function had better performance 
than the sole MSE, this was only statistically significant in one out of three 
comparisons in the quantitative analysis. Therefore, further studies would 
be needed for more confident assertions about MSGMS’s potential benefits 
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from a quantitative standpoint. However, the qualitative analysis is not as 
affected by the limited sample size. The heatmaps were inspected next to 
each other to pick the best run, a best run was not found since all three runs 
looked practically identical for each InTra setting. It seems impossible that 
all 18 runs would be outliers that looked practically identical within the three 
samples. This is further emphasized by no perceived outlier run throughout 
the implementation and other testing of InTra. We therefore believe that 
InTra's performance was robust. The results in the expert ranking score seen 
in Table 4.3 showed no difference in InTra medium, a very minor advantage 
to pure MSE in InTra small and a greater difference in InTra full favoring 
MSE and MSGMS. The difference in favor of MSE and MSGMS is an order 
of magnitude greater than that of pure MSE, which seems to indicate and 
further emphasize the trends seen in the quantitative analysis. With the joint 
quantitative and qualitative analysis of the different loss functions, MSGMS 
appears to aid in the task of anomaly detection using InTra, especially in the 
larger version of the model. 

Let's revisit the research question of this thesis: What performance can be 
achieved using vision transformers for anomaly detection on servo logs? 


* Can it be useful by performing better than existing, convolutional-based 
methods, in some aspect? 


* Can vision transformers help with the local-global tradeoff? 
* What are the best training losses for this task? 


A similar performance can be achieved with a 4x increase in patch 
size. We see that InTra performs worse on the quantitative comparison 
and slightly better on the qualitative. Given the inherent bias towards 
the convolutional models in the quantitative comparison and slightly better 
performance in the more task-specific metric of quality heatmap output. InTra 
and the convolutional models had a similar performance on the dataset at 
hand. Moving on the the sub-questions, InTra performs better in some 
aspects of finding new anomalies and producing slightly better heatmaps. 
The motivational tradeoff problem of larger patch divisions leading to worse 
performance on local anomalies has been resolved, with InTra performing on 
a similar level as the current convolutional models at Mycronic. However, the 
hypothesis of improved performance on global anomalies was not achieved 
with InTra. We do believe that the increased patch size with similar 
performance on local anomalies is a great first step towards this goal of global 
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anomaly performance. Lastly, the best-performing training losses are that of 
MSE and MSE + MSGMS. 

The reason that InTra medium and the joint loss perform better might 
be related. As mentioned in Section 2, some of the literature seems to 
indicate a plateau of performance for reconstruction-based anomaly detection 
models because they become too good at reconstructing [28][27]. The larger 
InTra model could be ”too good” at reconstructing and therefore impede 
the anomaly detection performance. The literature also sees a risk of worse 
performance when only using MSE on more complex datasets [29], therefore 
the introduction of MSGMS could help with performance. For the model size, 
small could be seen as too small to learn the normality well enough, medium 
hits the sweet spot of not too good and not too bad, and InTra full becomes 
too good at generalizing the reconstruction and therefore performs worse in 
the anomaly detection aspect. 


5.2 Design choices 


A design choice that can quite visibly be seen, most prominently in Figure 
4.2b is the amount of dead space. This is due to InTra’s increased patch size 
in comparison to the convolutional models, 64 x 64 vs 256 x 256. This results 
in places where data is missing in the input twice in the span of 256 units, 
InTra will skip due to its larger patch size. This could be remedied by adding 
some sort of padding to still get output from the model in these places but this 
was deemed superfluous and it did not affect the results which is what we are 
concerned about. 

Another choice regarding the presentation of the results is the lack of 
heatmap outputs from InTra small. This was mainly to save space and make 
the figures more readable as they are already tough to interpret at this smaller 
size. Further, Table 4.3 shows InTra full and small performing the same. 
InTra medium was included as it is the best-performing InTra model and InTra 
full was included as the worst-performing model. Therefore the decision was 
made to exclude InTra small’s heatmap output for readability and to reduce 
superfluous information. 


5.3 Ethics and sustainability 


AI and machine learning brings forth a lot of potential for extending 
human potential, effectivize resource usage, and automating repetitive tasks. 


40 | Discussion 


Although, there are some potential risks from an ethical and sustainability 
perspective. This project has explored and adapted the use of vision 
transformers for use in an industrial inspection setting. Just like several 
other scientific papers on industrial inspection, this ultimately is to increase 
efficiency. On one hand, this can be seen as a good thing from an ethical 
and sustainability perspective, since fewer resources will be needed if the 
production is more effective and workers could be alleviated of monotonous 
tasks. On the other hand, an increase in efficiency has a risk of increasing 
resource usage, this phenomenon is often referred to as the Jevons paradox or 
rebound effect. It refers to situations where there is an increase in efficiency 
but results in a lesser decrease in resource usage and sometimes even an 
increase. For example, a car’s fuel efficiency is increased by 10% but the 
fuel usage decreases by less than or even increased, possibly due to the price 
of driving going down incentivizes more driving [56]. A more applicable 
example would be an increase in efficiency in production leading to a decrease 
in the price, which leads to higher demand and possibly more purchasing of 
said product. This is an important factor to keep in mind moving forward, 
however increasing efficiency for photomask production will yield fewer 
resources needed in the production of one unit, which is a net positive from 
an environmental sustainability point of view. Lastly, for the sustainability 
aspect, machine learning models need high-performance graphics cards and 
can hardly be seen as environmentally friendly. Both in the environmental 
impact of producing the product and the energy consumption needed to run the 
training and inference. Training only happens once so the bulk of the energy 
consumption will come during the inference time. The inference time varied 
depending on the model size and size of the log, from 3 to 10 minutes with 
a 350W RTX 3080 Ti that results conservatively in 0.06 kWh per inference. 
This is quite small and taking into account the possible savings of not having 
to re-write a multitude of masks, the power consumption is most likely worth 
it environmentally. 

Another risk with AI/ML is the automation aspect causing unemployment. 
While this is a risk, it does not apply directly to this field of industrial anomaly 
detection. Subjective human expert inspection is used in the field and it is 
hard to translate this into software and the use of machine learning can help 
the experts in the inspection and possibly perform inspections that humans 
cannot. Additionally, this project is not automating existing tasks, but rather 
adding an extra analysis step as seen in Figure 2.3. 
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Chapter 6 


Conclusions and Future work 


6.1 Conclusions 


This thesis has explored the use of vision transformers to detect and localize 
anomalies in servo logs of mask writers. Transformers was chosen as a 
contender to alleviate a tradeoff problem with convolutional models. Due 
to the size of the input data it is perceived as an image and divided into 
smaller patches, this has resulted in worse performance on smaller anomalies 
with larger patch sizes. Doing this by implementing a reconstruction model, 
InTra, which uses inpainting to learn the normality of the data and uses 
the reconstruction error as the loss and anomaly score. We trained and 
evaluated InTra against two other convolutional nets, DAGAN and U-Net, 
using quantitative and qualitative evaluations. The quantitative evaluation 
consisted of ROC AUC and PRO AUC metrics and the qualitative evaluation 
consisted of expert scoring of the heatmaps produced by the models. 

AS for the results, InTra shows similar performance to DAGAN and U- 
Net, performing worse in the quantitative and slightly better in the qualitative 
results. InTra is able to detect small anomalies even with a four times larger 
patch size and even weakly finds a known local anomaly that the convolutional 
models cannot. 


6.2 Future work 


There are many possible further studies on the subject to confirm and extend 
the findings of this thesis. A simple first step would be a fairer comparison 
between convolutions and transformers. For example a pure convolutional U- 
Net structure and a pure transformer U-Net structure or other simple methods. 
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By taking a step back in performance, one could hopefully see the pure effects 
of transformers or convolutions instead of other possible methods interfering. 
Another simple study would be a more rigorous hyperparameter search, as well 
as running more samples for the validity of the vague MSGMS conclusion. 
Relating to the tradeoff problem, with more resources a bigger patch size can 
be used and see if the InTra results still hold for 512 x 512 or even larger 
patches. 

Another direction is ditching reconstruction and possibly even trans- 
formers for the recent best-performing model on MVTEC-AD, ReConPatch. 
There is a possibility that reconstruction models are plateauing and exploring 
the new possible SOTA distance-based anomaly detection. An important 
detail of ReConPatch’s success is its new way of using pre-trained networks, 
using parts of the whole network. Something that the current hybrid 
transformer/convolution models are not utilizing and might be worth further 
studying to improve their current lacking results. 
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Anomalidetektering går ut på att identifiera data som avviker från en normal uppsättning av data. När 
man hanterar stórre data ár ett vanligt tillvágagángsátt att dela upp datan i lika stora lappar fór 
att góra den mer hanterbar och báttre som modellinput. Ett problem som uppstár vid anvándning av 
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anomalidetekteringsproblem, transformern ár en perfekt kandidat fór detta pá grund av dess 
"self-attention" mekanism. Efter en litteraturstudie av fáltet valde vi att implementera InTra, en 
transformer som tránar pá att áterstálla en ifylld del av bilden med hjálp av nárliggande delar av 
bilden. En fyrdubbling av lappstorleken uppnáddes med liknande prestanda till tidigare 
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fórbáttrar den lokala prestandan pá vissa omráden och lyckas identifiera kánda anomalier som 
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En intressant upptäckt är de lovande resultaten av Multiple Scale Gradient Magnitude Similarity 
(MSGMS), som presterar sämre på egen hand men bra resultat när den används i kombination med det 
vanliga förlustfunktionen genomsnittligt kvadratiskt fel (MSE). 
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